Tutorial on Poincaré Embeddings

This notebook discusses the basic ideas and use-cases for Poincaré embeddings and demonstrates what kind of operations can be done with them. For more comprehensive technical details and results, this blog post may be a more appropriate resource.

1. Introduction

1.1 Concept and use-case

Poincaré embeddings are a method to learn vector representations of nodes in a graph. The input data is of the form of a list of relations (edges) between nodes, and the model tries to learn representations such that the vectors for the nodes accurately represent the distances between them.

The learnt embeddings capture notions of both hierarchy and similarity - similarity by placing connected nodes close to each other and unconnected nodes far from each other; hierarchy by placing nodes lower in the hierarchy farther from the origin, i.e. with higher norms.

The paper uses this model to learn embeddings of nodes in the WordNet noun hierarchy, and evaluates these on 3 tasks - reconstruction, link prediction and lexical entailment, which are described in the section on evaluation. We have compared the results of our Poincaré model implementation on these tasks to other open-source implementations and the results mentioned in the paper.

The paper also describes a variant of the Poincaré model to learn embeddings of nodes in a symmetric graph, unlike the WordNet noun hierarchy, which is directed and asymmetric. The datasets used in the paper for this model are scientific collaboration networks, in which the nodes are researchers and an edge represents that the two researchers have co-authored a paper.

This variant has not been implemented yet, and is therefore not a part of our tutorial and experiments.

1.2 Motivation

The main innovation here is that these embeddings are learnt in hyperbolic space, as opposed to the commonly used Euclidean space. The reason behind this is that hyperbolic space is more suitable for capturing any hierarchical information inherently present in the graph. Embedding nodes into a Euclidean space while preserving the distance between the nodes usually requires a very high number of dimensions. A simple illustration of this can be seen below -

Here, the positions of nodes represent the positions of their vectors in 2-D euclidean space. Ideally, the distances between the vectors for nodes (A, D) should be the same as that between (D, H) and as that between H and its child nodes. Similarly, all the child nodes of H must be equally far away from node A. It becomes progressively hard to accurately preserve these distances in Euclidean space as the degree and depth of the tree grows larger. Hierarchical structures may also have cross-connections (effectively a directed graph), making this harder.

There is no representation of this simple tree in 2-dimensional Euclidean space which can reflect these distances correctly. This can be solved by adding more dimensions, but this becomes computationally infeasible as the number of required dimensions grows exponentially. Hyperbolic space is a metric space in which distances aren't straight lines - they are curves, and this allows such tree-like hierarchical structures to have a representation that captures the distances more accurately even in low dimensions.

2. Training the embedding



In [1]:

    
%cd ../..









    



/home/misha/git/gensim



In [2]:

    
# %load_ext autoreload   
# %autoreload 2

import os
import logging
import numpy as np

from gensim.models.poincare import PoincareModel, PoincareKeyedVectors, PoincareRelations

logging.basicConfig(level=logging.INFO)

poincare_directory = os.path.join(os.getcwd(), 'docs', 'notebooks', 'poincare')
data_directory = os.path.join(poincare_directory, 'data')
wordnet_mammal_file = os.path.join(data_directory, 'wordnet_mammal_hypernyms.tsv')

The model can be initialized using an iterable of relations, where a relation is simply a pair of nodes -



In [3]:

    
model = PoincareModel(train_data=[('node.1', 'node.2'), ('node.2', 'node.3')])









    



INFO:gensim.models.poincare:loading relations from train data..
INFO:gensim.models.poincare:loaded 2 relations from train data, 3 nodes

The model can also be initialized from a csv-like file containing one relation per line. The module provides a convenience class PoincareRelations to do so.



In [4]:

    
relations = PoincareRelations(file_path=wordnet_mammal_file, delimiter='\t')
model = PoincareModel(train_data=relations)









    



INFO:gensim.models.poincare:loading relations from train data..
WARNING:smart_open.smart_open_lib:this function is deprecated, use smart_open.open instead
INFO:gensim.models.poincare:loaded 7724 relations from train data, 1182 nodes

Note that the above only initializes the model and does not begin training. To train the model -



In [5]:

    
model = PoincareModel(train_data=relations, size=2, burn_in=0)
model.train(epochs=1, print_every=500)









    



INFO:gensim.models.poincare:loading relations from train data..
WARNING:smart_open.smart_open_lib:this function is deprecated, use smart_open.open instead
INFO:gensim.models.poincare:loaded 7724 relations from train data, 1182 nodes
INFO:gensim.models.poincare:training model of size 2 with 1 workers on 7724 relations for 1 epochs and 0 burn-in epochs, using lr=0.10000 burn-in lr=0.01000 negative=10
INFO:gensim.models.poincare:starting training (1 epochs)----------------------------------------
INFO:gensim.models.poincare:training on epoch 1, examples #4990-#5000, loss: 23.57
INFO:gensim.models.poincare:time taken for 5000 examples: 0.69 s, 7268.98 examples / s
INFO:gensim.models.poincare:training finished

The same model can be trained further on more epochs in case the user decides that the model hasn't converged yet.



In [6]:

    
model.train(epochs=1, print_every=500)









    



INFO:gensim.models.poincare:training model of size 2 with 1 workers on 7724 relations for 1 epochs and 0 burn-in epochs, using lr=0.10000 burn-in lr=0.01000 negative=10
INFO:gensim.models.poincare:starting training (1 epochs)----------------------------------------
INFO:gensim.models.poincare:training on epoch 1, examples #4990-#5000, loss: 22.37
INFO:gensim.models.poincare:time taken for 5000 examples: 0.67 s, 7412.15 examples / s
INFO:gensim.models.poincare:training finished

The model can be saved and loaded using two different methods -



In [7]:

    
# Saves the entire PoincareModel instance, the loaded model can be trained further
model.save('/tmp/test_model')
PoincareModel.load('/tmp/test_model')









    



INFO:gensim.utils:saving PoincareModel object under /tmp/test_model, separately None
INFO:gensim.utils:not storing attribute _node_probabilities
INFO:gensim.utils:not storing attribute _node_counts_cumsum
WARNING:smart_open.smart_open_lib:this function is deprecated, use smart_open.open instead
INFO:gensim.utils:saved /tmp/test_model
INFO:gensim.utils:loading PoincareModel object from /tmp/test_model
WARNING:smart_open.smart_open_lib:this function is deprecated, use smart_open.open instead
INFO:gensim.utils:loading kv recursively from /tmp/test_model.kv.* with mmap=None
INFO:gensim.utils:setting ignored attribute _node_probabilities to None
INFO:gensim.utils:setting ignored attribute _node_counts_cumsum to None
INFO:gensim.utils:loaded /tmp/test_model






    Out[7]:





<gensim.models.poincare.PoincareModel at 0x7f354560b860>



In [8]:

    
# Saves only the vectors from the PoincareModel instance, in the commonly used word2vec format
model.kv.save_word2vec_format('/tmp/test_vectors')
PoincareKeyedVectors.load_word2vec_format('/tmp/test_vectors')









    



INFO:gensim.models.utils_any2vec:storing 1182x2 projection weights into /tmp/test_vectors
WARNING:smart_open.smart_open_lib:this function is deprecated, use smart_open.open instead
INFO:gensim.models.utils_any2vec:loading projection weights from /tmp/test_vectors
WARNING:smart_open.smart_open_lib:this function is deprecated, use smart_open.open instead
INFO:gensim.models.utils_any2vec:loaded (1182, 2) matrix from /tmp/test_vectors






    Out[8]:





<gensim.models.poincare.PoincareKeyedVectors at 0x7f3545623b38>

3. What the embedding can be used for



In [9]:

    
# Load an example model
models_directory = os.path.join(poincare_directory, 'models')
test_model_path = os.path.join(models_directory, 'gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20_dim_50')
model = PoincareModel.load(test_model_path)









    



INFO:gensim.utils:loading PoincareModel object from /home/misha/git/gensim/docs/notebooks/poincare/models/gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20_dim_50
WARNING:smart_open.smart_open_lib:this function is deprecated, use smart_open.open instead






    



---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-9-6ead1966aa14> in <module>
      2 models_directory = os.path.join(poincare_directory, 'models')
      3 test_model_path = os.path.join(models_directory, 'gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20_dim_50')
----> 4 model = PoincareModel.load(test_model_path)

~/git/gensim/gensim/models/poincare.py in load(cls, *args, **kwargs)
    394 
    395         """
--> 396         model = super(PoincareModel, cls).load(*args, **kwargs)
    397         model._init_node_probabilities()
    398         return model

~/git/gensim/gensim/utils.py in load(cls, fname, mmap)
    424         compress, subname = SaveLoad._adapt_by_suffix(fname)
    425 
--> 426         obj = unpickle(fname)
    427         obj._load_specials(fname, mmap, compress, subname)
    428         logger.info("loaded %s", fname)

~/git/gensim/gensim/utils.py in unpickle(fname)
   1379 
   1380     """
-> 1381     with smart_open(fname, 'rb') as f:
   1382         # Because of loading from S3 load can't be used (missing readline in smart_open)
   1383         if sys.version_info > (3, 0):

~/envs/gensim/lib/python3.7/site-packages/smart_open/smart_open_lib.py in smart_open(uri, mode, **kw)
    437             transport_params[key] = value
    438 
--> 439     return open(uri, mode, ignore_ext=ignore_extension, transport_params=transport_params, **scrubbed_kwargs)
    440 
    441 

~/envs/gensim/lib/python3.7/site-packages/smart_open/smart_open_lib.py in open(uri, mode, buffering, encoding, errors, newline, closefd, opener, ignore_ext, transport_params)
    305         buffering=buffering,
    306         encoding=encoding,
--> 307         errors=errors,
    308     )
    309     if fobj is not None:

~/envs/gensim/lib/python3.7/site-packages/smart_open/smart_open_lib.py in _shortcut_open(uri, mode, ignore_ext, buffering, encoding, errors)
    496     #
    497     if six.PY3:
--> 498         return _builtin_open(parsed_uri.uri_path, mode, buffering=buffering, **open_kwargs)
    499     elif not open_kwargs:
    500         return _builtin_open(parsed_uri.uri_path, mode, buffering=buffering)

FileNotFoundError: [Errno 2] No such file or directory: '/home/misha/git/gensim/docs/notebooks/poincare/models/gensim_model_batch_size_10_burn_in_0_epochs_50_neg_20_dim_50'

The learnt representations can be used to perform various kinds of useful operations. This section is split into two - some simple operations that are directly mentioned in the paper, as well as some experimental operations that are hinted at, and might require more work to refine.

The models that are used in this section have been trained on the transitive closure of the WordNet hypernym graph. The transitive closure is the list of all the direct and indirect hypernyms in the WordNet graph. An example of a direct hypernym is (seat.n.03, furniture.n.01) while an example of an indirect hypernym is (seat.n.03, physical_entity.n.01).

3.1 Simple operations

All the following operations are based simply on the notion of distance between two nodes in hyperbolic space.



In [ ]:

    
# Distance between any two nodes
model.kv.distance('plant.n.02', 'tree.n.01')



In [ ]:

    
model.kv.distance('plant.n.02', 'animal.n.01')



In [ ]:

    
# Nodes most similar to a given input node
model.kv.most_similar('electricity.n.01')



In [ ]:

    
model.kv.most_similar('man.n.01')



In [ ]:

    
# Nodes closer to node 1 than node 2 is from node 1
model.kv.nodes_closer_than('dog.n.01', 'carnivore.n.01')



In [ ]:

    
# Rank of distance of node 2 from node 1 in relation to distances of all nodes from node 1
model.kv.rank('dog.n.01', 'carnivore.n.01')



In [ ]:

    
# Finding Poincare distance between input vectors
vector_1 = np.random.uniform(size=(100,))
vector_2 = np.random.uniform(size=(100,))
vectors_multiple = np.random.uniform(size=(5, 100))

# Distance between vector_1 and vector_2
print(PoincareKeyedVectors.vector_distance(vector_1, vector_2))
# Distance between vector_1 and each vector in vectors_multiple
print(PoincareKeyedVectors.vector_distance_batch(vector_1, vectors_multiple))

3.2 Experimental operations

These operations are based on the notion that the norm of a vector represents its hierarchical position. Leaf nodes typically tend to have the highest norms, and as we move up the hierarchy, the norm decreases, with the root node being close to the center (or origin).



In [ ]:

    
# Closest child node
model.kv.closest_child('person.n.01')



In [ ]:

    
# Closest parent node
model.kv.closest_parent('person.n.01')



In [ ]:

    
# Position in hierarchy - lower values represent that the node is higher in the hierarchy
print(model.kv.norm('person.n.01'))
print(model.kv.norm('teacher.n.01'))



In [ ]:

    
# Difference in hierarchy between the first node and the second node
# Positive values indicate the first node is higher in the hierarchy
print(model.kv.difference_in_hierarchy('person.n.01', 'teacher.n.01'))



In [ ]:

    
# One possible descendant chain
model.kv.descendants('mammal.n.01')



In [ ]:

    
# One possible ancestor chain
model.kv.ancestors('dog.n.01')

Note that the chains are not symmetric - while descending to the closest child recursively, starting with mammal, the closest child of carnivore is dog, however, while ascending from dog to the closest parent, the closest parent to dog is canine.

This is despite the fact that Poincaré distance is symmetric (like any distance in a metric space). The asymmetry stems from the fact that even if node Y is the closest node to node X amongst all nodes with a higher norm (lower in the hierarchy) than X, node X may not be the closest node to node Y amongst all the nodes with a lower norm (higher in the hierarchy) than Y.